Search CORE

15 research outputs found

New Techniques for On-line Testing and Fault Mitigation in GPUs

Author: RODRIGUEZ CONDIA JOSIE ESTEBAN
Publication venue: country:Italy
Publication date: 24/09/2021
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Testing the Divergence Stack Memory on GPGPUs: A Modular in-Field Test Strategy

Author: Rodriguez Condia Josie Esteban
Sonza Reorda M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

General Purpose Graphic Processing Units (GPGPUs) are becoming a promising solution in safety-critical applications, e.g., in the automotive domain. In these applications, reliability and functional safety are relevant factors in the selection of devices to build the systems. Nowadays, many challenges are impacting the implementation of high-performance devices, such as GPGPUs. Moreover, there is the need for effective fault detection solutions to guarantee the correct in-field operation of a GPGPU, such as in the branch management unit, which is one of the most critical modules in this parallel architecture. Faults affecting this structure can heavily corrupt or even collapse the execution of an application on the GPGPU. In this work, we propose a non-invasive Software-Based Self-Test (SBST) solution to detect faults affecting the memory in the branch management unit of a GPGPU. We propose a scalar and modular mechanism to develop the test program as a combination of software functions. The FlexGripPlus model was employed to evaluate the proposed strategies experimentally. Results show that the proposed strategies are effective to test the target structure and detect up to 98% of permanent faults. General Purpose Graphic Processing Units (GPGPUs) are becoming a promising solution in safety-critical applications, e.g., in the automotive domain. In these applications, reliability and functional safety are relevant factors in the selection of devices to build the systems. Nowadays, many challenges are impacting the implementation of high-performance devices, such as GPGPUs. Moreover, there is the need for effective fault detection solutions to guarantee the correct in-field operation of a GPGPU, such as in the branch management unit, which is one of the most critical modules in this parallel architecture. Faults affecting this structure can heavily corrupt or even collapse the execution of an application on the GPGPU. In this work, we propose a non-invasive Software-Based Self-Test (SBST) solution to detect faults affecting the memory in the branch management unit of a GPGPU. We propose a scalar and modular mechanism to develop the test program as a combination of software functions. The FlexGripPlus model was employed to evaluate the proposed strategies experimentally. Results show that the proposed strategies are effective to test the target structure and detect up to 98% of permanent faults

ZENODO

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

An extended model to support detailed GPGPU reliability analysis

Author: Du B.
Reorda M. S.
RODRIGUEZ CONDIA JOSIE ESTEBAN
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

General Purpose Graphics Processing Units (GPGPUs) have been used in the last decades as accelerators in high demanding data processing applications, such as multimedia processing and high-performance computing. Nowadays, these devices are becoming popular even in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of transient faults, such as those produced by radiation effects. These effects can be represented in the system as Single Event Upsets (SEUs) and are able to generate intolerable application misbehaviors in safety critical environments. In this work, we extended the capabilities of an open-source VHDL GPGPU model (FlexGrip) in order to study and analyze in a much more detailed manner the effects of SEUs in some critical modules within a GPGPU. Simulation results showed that scheduler controller has different levels of SEU sensibility depending on the affected location. Moreover, a reduced number of execution units, in the GPGPU can decrease the system reliability

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Programmers manual FlexGripPlus SASS SM 1.0

Author: Du Boyang
Guerrero Balaguera Juan David
Roascio Gianluca
Rodriguez Condia Josie Esteban
Scie Edouard
Publication venue
Publication date: 01/01/2020
Field of study

This document describes the op-code of the assembly language SASS of the G80 architecture used in the FlexGripPlus model. Every instruction is compatible with the CUDA Programming environment under the SM_1.

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Protecting GPU's Microarchitectural Vulnerabilities via Effective Selective Hardening

Author: Carro Luigi
Fernandes dos Santos Fernando
Rech Paolo
Rodriguez Condia Josie Esteban
Sonza Reorda Matteo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Graphics Processing Units (GPUs) are today adopted in several domains for which reliability is fundamental, such as self-driving cars and autonomous machines. Unfortunately, on one side GPUs have been shown to have a high error rate and, on the other side, the constraints imposed by real-time safety-critical applications make traditional, costly, replication-based hardening solutions inadequate. This paper proposes an effective microarchitectural selective hardening of GPU modules to mitigate those faults that affect instructions correct execution. We first characterize, through Register-Transfer Level (RTL) fault injections, the architectural vulnerabilities of a GPU model (FlexGripPlus). We specifically target transient faults in the functional units and pipeline registers of a GPU core. Then, we apply selective hardening by triplicating the locations in each module that we found to be more critical. The results show that selective hardening using Triple Modular Redundancy (TMR) can correct 85% to 99% of faults in the pipeline registers and from 50% to 100% of faults in the functional units. The proposed selective TMR strategy reduces the hardware overhead by up to 65% when compared with traditional TMR

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection

Author: Carro Luigi
Fernandes dos Santos Fernando
Rech Paolo
Rodriguez Condia Josie Esteban.
Sonza Reorda Matteo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register-Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Characterizing a Neutron-Induced Fault Model for Deep Neural Networks

Author: David Guerrero Balaguera Juan
Esteban Rodriguez Condia Josie
Fernandes dos Santos Fernando
Kritikakou Angeliki
Rech Paolo
Sentieys Olivier
Sonza Reorda Matteo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/11/2022
Field of study

International audienceThe reliability evaluation of Deep Neural Networks (DNNs) executed on Graphic Processing Units (GPUs) is a challenging problem since the hardware architecture is highly complex and the software frameworks are composed of many layers of abstraction. While software-level fault injection is a common and fast way to evaluate the reliability of complex applications, it may produce unrealistic results since it has limited access to the hardware resources and the adopted fault models may be too naive (i.e., single and double bit flip). Contrarily, physical fault injection with neutron beam provides realistic error rates but lacks fault propagation visibility. This paper proposes a characterization of the DNN fault model combining both neutron beam experiments and fault injection at software level. We exposed GPUs running General Matrix Multiplication (GEMM) and DNNs to beam neutrons to measure their error rate. On DNNs, we observe that the percentage of critical errors can be up to 61%, and show that ECC is ineffective in reducing critical errors. We then performed a complementary software-level fault injection, using fault models derived from RTL simulations. Our results show that by injecting complex fault models, the YOLOv3 misdetection rate is validated to be very close to the rate measured with beam experiments, which is 8.66× higher than the one measured with fault injection using only single-bit flips

INRIA a CCSD electronic archive server

A New Method to Generate Software Test Libraries for In-Field GPU Testing Resorting to High-Level Languages

Author: Guerrero-Balaguera Juan-David
Rodriguez Condia Josie Esteban
Sonza Reorda Matteo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Self-Test Libraries (STLs) are widely used by companies for in-field fault detection in CPU devices. Their usage is now extending to GPUs, due to their increasing adoption in safety-critical applications. Using STLs provided by GPU manufacturers, system companies can effectively test these devices during their operative life, as required by functional safety standards. In the automotive domain, GPUs are often used to process a high amount of sensitive information in real-time (e.g., object recognition and path tracking). Thus, GPU devices in this field must guarantee functional safety features (e.g., ISO26262) by using one or more functional safety mechanisms. This paper presents a methodology to develop STLs resorting to High-Level Languages (HLLs) (e.g., CUDA), reducing the complexity of encoding at the assembly level. Moreover, we describe the main advantages and discuss the challenges and constraints when developing STLs with HLLs for GPUs. In particular, we describe those cases that demand the usage of a Low-Level Language (LLL). Additionally, we highlight a method to develop STLs resorting to HLLs, at least for some modules. The FlexGripPlus GPU model was employed to evaluate and validate the proposed strategies experimentally. The results show that STLs based on HLLs can be effectively developed for regular modules in the GPU

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

About the functional test of the GPGPU scheduler

Author: Du B.
RODRIGUEZ CONDIA JOSIE ESTEBAN
Sonza Reorda M.
Sterpone L.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/07/2019
Field of study

General Purpose Graphical Processing Units (GPGPUs) are increasingly used in safety critical applications such as the automotive ones. Hence, techniques are required to test them during the operational phase with respect to possible permanent faults arising when the device is already deployed in the field. Functional tests adopting Software-based Self-test (SBST) are an effective solution since they provide benefits in terms of intrusiveness, flexibility and test duration. While the development of the functional test code addressing the several computational cores composing a GPGPU can be done resorting to known methods developed for CPUs, for other modules which are typical of a GPGPU we still miss effective solutions. This paper focuses on one of the most relevant module consists on the scheduler core which is in charge of managing different scalar computational cores and the different executed threads. At first, we propose a method for evaluating the fault coverage that can be achieved using an application program. Then, we provide some guidelines for improving the achieved fault coverage. Experimental results are provided on an open-source VHDL model of a GPGPU

Crossref

ZENODO

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

A Novel Compaction Approach for SBST Test Programs

Author: Josie Esteban Rodriguez Condia
Juan David Guerrero Balaguera
Matteo Sonza Reorda
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

In-field test of processor-based devices is a must when considering safety-critical systems (e.g., in robotics, aerospace, and automotive applications). During in-field testing, different solutions can be adopted, depending on the specific constraints of each scenario. In the last years, Self-Test Libraries (STLs) developed by IP or semiconductor companies became widely adopted. Given the strict constraints of in-field test, the size and time duration of a STL is a crucial parameter. This work introduces a novel approach to compress functional test programs belonging to an STL. The proposed approach is based on analyzing (via logic simulation) the interaction between the micro-architectural operation performed by each instruction and its capacity to propagate fault effects on any observable output, reducing the required fault simulations to only one. The proposed compaction strategy was validated by resorting to a RISC-V processor and several test programs stemming from diverse generation strategies. Results showed that the proposed compaction approach can reduce the length of test programs by up to 93.9% and their duration by up to 95%, with minimal effect on fault coverage

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)